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Abstract 

We investigated the statistical properties of the K-index (Holland, 1996) that can be 
used to detect copying behavior on a test. A simulation study was conducted to investigate 
the applicability of the K-index for small, medium, and large datasets. Furthermore, the 
Type I error rate and the detection rate of this index were compared with the copying 
index, u (Wollack, 1997). Several approximations were used to calculate the K-index. 
Results showed that all approximations were able to hold the Type I error rates below the 
nominal level. Results further showed that using u resulted in higher detection rates than 
the K-indices for small and medium sample sizes (100 and 500 simulees). 

Key Words: IRT, nominal response model, copying indices, cheating 
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The variety of methods to cheat on educational tests seems to be only restricted to 
one’s imagination. In his book on cheating on tests, Cizek (1999, Chap. 3) gives an 
overview of several cheating methods. Among the methods discussed are using forbidden 
materials, circumventing the testing process, or even using microrecorders. 

In the present study, we will be concerned with a form of cheating that has received 
some attention in the recent literature, namely, answer copying. In this type of cheating, 
one examinee copies the answers from another examinee. This copying may take place 
from an examinee who is sitting in the neighborhood of the copier, although answer 
copying may also take place using all kinds of codes for transmitting answers and a code 
for doing so, for example, clicking of pens, tapping of the foot, and the like. Thus the 
examinees do not have to be in the physical neighborhood of each other. Because answer 
copying may invalidate an examinee’s test score, it is necessary to prevent those practices 
by using well-instructed proctors and construct the seating arrangements so that there is 
ample room between the examinees. However, if a proctor observes some irregularities, 
statistical methods may be used to obtain additional evidence of answer copying. 

Several methods have been proposed that all are based on determining the probability 
that the observed score patterns of two examinees under suspicion are similar. If this 
probability is high, this may indicate that one examinee copied the answers from another 
examinee. These chance methods can be classified into two types (Cizek, 1999, pp. 
138-139). One type of method compares an observed pattern of responses to a known 
theoretical distribution (e.g., Frary, Tideman, & Watts, 1977; Wollack, 1997). In the 
second type of method, the probability of an observed pattern is compared with a 
distribution of values derived from independent pairs of students who took the same test. 
An example of such a statistic is the K-index (Holland, 1996). 

In this paper we will investigate the statistical properties and the detection rate of the 
K-index which thus far is only described in a paper by Holland (1996) and applied on a 
few empirical datasets from Educational Testing Service (ETS). As Cizek (1999) noted, 
no comparative studies of the performance of this index are known, so it is unknown 
whether it performs better, worse, or the same as the other available indices. In this paper 



Statistical Properties of the K-Index - 4 



we will investigate the statistical properties of the K-index and, in particular, the small 
sample properties of this index. Furthermore, we will compare the detection rate of this 
index with the index, u, proposed by Wollack (1997). The major difference between the 
indices is that the K-index does not assume any test model, whereas u is based on item 
response theory modeling (e.g., van der Linden & Hambleton, 1997). 

This study is organized as follows. First, we will discuss the rationale behind the 
K-index and discuss several methods proposed by Holland (1996) to calculate this index. 
Second, we will discuss some existing practical problems when this index is applied in 
practice and we will propose two new methods to calculate this index. Third, we will 
conduct a simulation study to investigate the statistical properties of this index and finally, 
we will conduct a simulation study in which we compare the Type I error rate and detection 
rate of the K-index with the u statistic. 

The K-index 

The K-index is a statistic that can be used to assess the degree of unusual agreement 
between the incorrect answers on a multiple-choice test of two examinees; one referred 
to as the source- (s) and the other as the copier (c). The copier is suspected of copying 
answers from the source. Note that the K-index only takes the incorrect answers of the 
examinees into account. For a rationale behind this strategy, see Holland (1996). 

Notation 

The following notation will be used throughout the text. Let 
j (j = 1, . . . , J) denotes examinees, 
i (i = 1, . . . , I) denotes items, 

v (v = 1, . . . , V) denotes the item response categories, 
s denotes an examinee identified as the source, 
c denotes an examinee suspected of copying answers from s, 

Wj denotes the number of “wrong” answers of examinee j , 
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M with realization m denotes the number of matching wrong answers between examinee 
j and s, 

r = 1, . . . , d , . . . , R denotes subgroups of examinees, where each group has a distinct 

number of wrong answers and d is the group where examinee c belongs 

j' = 1, . . . , rw denotes an examinee in subgroup r, where each subgroup has at least one 

Ft 

examinee and n r = J — 1, 

r=l 

M r = (M r i, . . . , M r ji , . . . , M rnr ) denotes a vector of matching wrong answers in a 
particular subgroup r 

Me/ = (M d i, . . . , M dric , ) denotes a vector of the number of matching wrong answers of 
n c > examinees in subgroup d where subgroup d consists of the examinees with the same 
number-incorrect score as the copier, 

and let Q r — denote the proportion of wrong answers of subgroup r where I is the 
total number of items in the test. 

K-index Based on the Empirical Distribution 

The K-index can be determined using empirical data of J persons answering I items. 
To calculate the K-index based on the empirical data, we first determine the group of 
examinees with the same number-incorrect score as the copier (subgroup c ) and then for 
each of these examinees in subgroup d we determine the number of items that match 
the incorrect answers of the source. This is the vector and the distribution of 
comprises the empirical agreement distribution. For examinee c, we specifically denote 
rridc as the number of matching wrong answers between c and s. The random variable 
M rj is denoted as M if it is not necessary to identify the group membership of j. The K- 
index is defined as the proportion of examinees having the same number-incorrect score 
as c whose number of matching incorrect item scores with s is at least as large as c . 

For j' - 1, ... ,iv, let I dj > denote an indicator variable, coded as 1 for m djl > m dc , 
and 0 otherwise, then K is defined as 




7 



Statistical Properties of the K-Index - 6 



n c' 

K = ’— . (1) 

n c f 

The idea is that when K is very small there is statistical evidence that examinee c 
copied from examinee s. 

Note that, in general, the number of matching incorrect scores depends on the ability 
level of s and c. The number of matching incorrect answers is necessarily small when 
either s or c, or both have many correct scores (high ability), whereas it is large when both 
examinees have many wrong answers (low ability). In order to minimize the dependency 
of M on the ability level of the population of examinees, the K-index is computed 
conditional on the number of incorrect scores of the suspected copier. As a consequence, 
the number of examinees involved in the actual computation of the K-index (subgroup r ) 
becomes very small. We emphasize this because the number of examinees in a subgroup 
r influences the accuracy of the value of the K-index. When the sample size is small 
( J = 100) one alternative is to use a theoretical approximation to the empirical agreement 
distribution. 

K-index Based on Theoretical Approximations 

To use the K-index, one has to specify first the Type I error (a) which is defined as 
the probability of misclassifying an examinee as a copier. Ideally, we would like to have a 
statistic for which the nominal and empirical Type I error rates are similar. Note that in this 
type of statistical application, the main concern is to have a statistic that is not liberal-a 
statistic for which the empirical Type I error rate is at most as large as the nominal Type I 
error rate-because the consequence of misclassifying an honest examinee as a copier can 
be very serious at the individual level. 

Seaman et al. (1991; see also, Wollack, 1997) argued that copying indices that fail to 
hold the nominal Type I error rate should be considered unacceptable. On the other hand, 
the copying index should not be overly conservative; otherwise, the power of the copying 
index to detect true examinee copiers will be very low. 
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In general, a disadvantage of using the discrete empirical distribution in small 
samples is that the random variable M can only take a small number of values. As a 
result, it is often not possible to obtain a prespecified Type I error of say .05 (Agresti, 
1996, p. 43). 

Holland (1996) noted that the distribution of M can be approximated by the binomial 
distribution, that is: M ap ~ B(w 3 ,p) where w a , the number of wrong answers of the 
source is known, but p is unknown. Holland (1996) suggested two ways of approximating 
p. In the first approach, p is computed such that the binomial distribution and the empirical 
distribution of M have the same means. Let rrv denote the mean of the empirical 
agreement distribution which equals 



n c' 

Tn c iji 

— j '= 1 

77V = 

7V 



Then, an estimate of p denoted as p*, is defined as 



( 2 ) 



Let K* denote the K-index based on p*,, then K* is given by 

Wa / \ 

K* = P(M>m dc )= ^ { )(Pl') 9 ( l -P^) Wa ~ 9 - (4) 

g=m c , c \ 9 ' 

Holland (1996) showed using large empirical datasets that the binomial distribution 
using p* c yielded a “conservative” estimate of the empirical agreement distribution. That 
is, the K-index based on the binomial approximation is often stochastically higher than 
the K-index based on the empirical distribution (Agresti, 1990, p. 9). 

To calculate p*,, the response pattern of examinees in the subgroup d must be 
available. Furthermore, the value of p*, is affected by the sample size— the smaller 
the sample size, the less reliable is the estimate of p*i . Holland (1996) suggested to 
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approximate p* d through linear regression by utilizing the proportion of wrong answers 
(i Q r ) of each examinee in each number incorrect score subgroup r = 1 Using 

large datasets from ETS, Holland (1996) showed empirically that p*, where p* is defined 
analogously as in equation (3), is linearly related to Q r . Let p T be the estimate of the 
binomial probability p* using Q r . The expression for p T is given as a piece -wise linear 
function with a and b as the intercept and slope parameters, respectively: 

^ J a, 4- bQ r if 0 < Q r ^ 0.3 /e\ 

Pr ~\ [a + .3b) + Ab{Q r -.3] if 0.3<Q r <l V ' 

Note that a and b have to be specified in order to estimate p r in equation (5). Holland 
(1996) used a = 0.085 and different values for b depending on the particular test that was 
used. However, from his study it is unclear how these values were obtained. Besides, 
they may vary across different tests. 

In the present study, we will propose p{ and p* 2 as estimates of p* based on linear and 
quadratic regression approach. Based on these estimates of p * , two versions of K-index, 
~Ki and K 2 are defined as 



We / \ 

K,=P(M>m^)= £ (“■ (Pt)'(l-ffl 



g=m c / c 



\w s -g 



( 6 ) 



and 



W 3 / \ 

K 2 = P(M>m dc )= £ r S )(F 2 ) 9 (l -&)”'-*• (7) 

g—m c i c \ 9 ' 

Note that only those examinees belonging to subgroup d are used to estimate p by 
p* d . On the other hand, p{ and p 2 use relevant information from R subgroups. Therefore, 
p[ and P 2 are expected to provide better estimates of p than p d . 

The main aim of this study is to explore the usefulness of the K-index and its 
approximations given in equations (4), (6), and (7) under varying testing conditions. 
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First, we will investigate if the linear relationship between p* and Q r found by Holland 
(1996) also applies for relatively small datasets. Second, we will investigate the fit of 
the binomial distribution using p*,, p[, and p* 2 as an approximation to the distribution 
of M. Finally, we will determine the empirical Type I error rates and detection rates of 
the K-index and the u> statistic (Wollack, 1997). Because we will use u to evaluate the 
performance of the K-index, we will introduce this statistic first. 

The u> statistic 

Wollack (1997) proposed the u> copying index that is formulated in the context of 
the nominal response model (NRM, Bock, 1972). To determine u>, the NRM is used to 
estimate the probability that an examinee responds to one of the item response categories 
u[= 1, . . . , h , . . . , V]. Under the NRM, the probability of examinee j with ability level 9j 
responding to option h of item i with intercept and slope parameters ( ih and X ih is given 
as 

P .<#,) = ; xp(u + A -^) . (8> 

E ex P(C« + 

v=l 

Let h C3 be the number of identically answered items of s and c, let E(h C s\O c , U s , 0 
be the expected value of h cs conditional on the ability level of the copier ( 0 C ), the item 
response vector of the source (U s ), and the item parameters (£). Furthermore, let a hc9 be 
the standard deviation of h cs . Then u> is given by 

hcs E( < h CS \d C) Us, 

u> , \*) 

Vhcs 



where 

j 

E{hcs\6 c , U S) ^ ^ P{ u jc = u js\6o Usj €)• 

j= i 
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Using the NRM, the probabilities of c selecting the responses of s can be determined. 
For any pair of examinees s and c, the distribution of id approaches the standard normal 
(Wollack, 1997) as the number of test items becomes infinitely large. Thus, the id values 
can be evaluated for statistical significance using the standard normal distribution. 

The id statistic is very similar to the g 2 index proposed by Frary et al. (1977). 
The main difference is in the way the expected value of hcs is computed; id uses the 
nominal response model conditional on 9 C , U s , and £, whereas g 2 uses item distractors 
and difficulties from classical test theory and the ratio of the copier’s number-correct score 
to the mean number-correct score for all examinees. 

Wollack (1997) compared the empirical Type I error rates and the power of id and g 2 . 
The results showed that id performed better than g 2 in detecting answer copying, under 
the conditions simulated. In particular, g 2 failed to maintain the nominal Type I error rate 
which he found was too liberal in all circumstances. Therefore, in this study, the empirical 
Type I error and detection rates of the K-index were compared with id. 

Although both the K-index and id make use of item response similarities, id compares 
the responses of the copier to the entire response vector of the source, whereas in the K- 
index, the incorrect responses of the copier are compared with the incorrect responses of 
the source. Wollack (1996, p.13) pointed out that the power of a statistic that does not 
take into account the information from correctly answered items is likely to be reduced 
due to a reduction in the number of operational items used. Besides, examinees that are 
most likely to be caught are those who miss several items. He added that “it is often 
not worthwhile to pursue a cheating claim if the alleged copier received a low score”; an 
argument against a copying index that disregards correctly answered items such as the 
K-index. 

The id statistic is based on IFT modeling, in particular the nominal response model. 
First, it is reasonable to assume that the fit of the model to the data is important for the id 
statistic to perform well. Second, if the suspected examinee copied a considerable number 
of items from the source, the ability level of the copier will be overestimated which 
consequently affects the value of id. Finally, the estimation of the item parameters used 
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in the NRM requires large number of examinees (Wollack, 1997); a requirement which 
may restrict the usefulness of this index in cases where large datasets are not available, 
although Wollack (1998) showed that estimating the item parameters on sample size as 
small as 100 for 40 and 80 items test did not result in an increase in Type I error or a 
significant loss in power. 

The K-index on the other hand, does not assume any IRT model and is therefore easier 
to apply in practice. However, a drawback of this index is that the number of examinee 
in each score group based on the number-incorrect scores should be large enough to 
obtain a reliable estimate of the binomial p. For example, when simulating 10 times a 
test consisting of 40 items and drawing 9 from the standard normal distribution for 30 
simulees, the number of score groups ranges from 19 through 22 with score groups with 
only 1 simulee ranging from 12 through 15 (60-74%) and other score groups consisting 
of only 2 or 3 simulees. Thus, p is very unreliably estimated for these samples. 

Method 



Data Generation 

The NRM was used to generate item scores on multiple-choice tests with five options. 
Test lengths were 40 and 80 items and the number of simulees in the sample were 100, 500, 
and 2000. These numbers were chosen to reflect small, medium, and large sample sizes. 
To be able to compare the results in this study with the results obtained by Wollack (1997), 
the same item parameters were chosen as in his study which were based on empirical data 
of a mathematics college placement test. Similarly, the ability parameter, 9j, was drawn 
from N(0 , 1). Given the item and ability parameters, Pih(9j) was computed for all i, h 
and j, using equation (8). 

Items with five answer categories were considered. The observed response of 
examinee j to item i was obtained by drawing a sample from the set v = { 1, . . . , 5}, where 
each element of v has a probability of being drawn equal to Pn(6j), P . . . , PitiQj) 
respectively. In the NRM, the category with the largest algebraic value for A has 
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a monotonically increasing response function. As in other studies (e.g., Thissen & 
Steinberg, 1997), this category was chosen as the keyed alternative. 

Simulation of Copying 

To simulate copying, s and c were identified based on their ability percentile rank. 
Because in practice we are mainly interested in obtaining additional statistical evidence 
of answer copying for examinees that raise their scores by copying answers from an 
examinee with higher ability, we choose c such that the ability percentile rank of c is 
lower than that of s. This was also done to reflect the fact that the source is often a person 
with higher ability level than the copier (Holland, 1996). Simulees were first ordered 
according to 6. Then, in each dataset, the source was selected as the simulee at the 90th 
or 60th percentile rank. In each dataset, 5% copiers were selected randomly from the 
simulees with 6 level below the 6 level of the source. 

Similar to Wollack (1997), copying was simulated by first randomly selecting an 
item and then altering the response of c to match the responses of s. This was done as 
follows. First n% (e.g., 10%, 20%, 30%, 40%) of the items were randomly selected and 
then the item scores of c on these items were changed to match the item scores of s. For 
both 40-item and the 80-item tests, 10%, 20%, 30%, and 40% of the item scores were 
changed corresponding to 4, 8, 12, and 16 items in the 40-items test and 8, 16, 24, and 32 
items in the 80-items test. The four factors - sample size (3 levels), number of items (2 
levels), ability level of the source (2 factors), and percentage of items copied (4 levels) - 
were completely crossed to simulate 48 testing conditions. A program in S-plus (S-PLUS 
2000, MathSoft Inc.) was written by the authors that performed the required simulation 
and necessary routine calculations. 

The data used in this study share the following similar features with the data used 
by Wollack (1997): [1] the copier copied from a more able source; [2] the number of 
copiers in each dataset and the percentage copied were the same, and [3] the same item 
parameters and distributional assumption were made for the 6 parameters. 

A difference with Wollack (1997) is that we did not use a seating chart to identify 
the s - c pair. We assumed that there is a suspicion that c copied the answers from s. 




14 



Statistical Properties of the K-Index - 13 



The K-index and the u statistic were then used to check the probability that copying has 
occurred for a particular s — c pair of examinees. So we did not use the statistics as a 
screening device. Wollack (1996) pointed out that in situations where there is only one 
source, u has the highest power. 

Data Analysis 

Relationship Between p* and Q r 

Recall that Q r (r = 1,...,R) denote the proportion of wrong answers in each 
number-incorrect score group. For each score group r, we computed the binomial 
probability p* T using equation (3) with m d replaced by m r which is the mean of the 
empirical agreement distribution for subgroup r. To explore the relationship between Q r 
and p*, we first created scatterplots for p* and Q r . The information derived from visual 
inspection of these scatterplot suggested the kind of regression models to be fitted. On 
the basis of the results discussed below and on the empirical results obtained by Holland 
(1996), two standard linear regression models were proposed: (a) p[ = 0 o +0iQ r +£r and 
(b) p* 2 = 0 O + Q r + (3 2 Ql + e r , where 0 Q and 0 X are the slope and intercept parameters 
respectively, 0 2 > s a regression parameter that indicates direction and amount of curvature, 
and e r is an error term which is assumed to have a normal distribution with mean 0 and 
constant variance o 2 . The fit of the two models was determined using the coefficient of 
multiple determination ( R 2 ) and the magnitude of the residual standard error (see Neter 
et al., 1996). R 2 measures the proportionate reduction of total variation in p* associated 
with the use of Q r . The model with the largest R 2 and the smallest RSE was preferred. 

Type I Error and Detection Rates 

For a given a, a simulee was identified as a copier when the value of the K-index 
was less than or equal to a. For the u> statistic, a simulee was identified as a copier when 
the value of u was above the one-tailed critical value corresponding to the upper a of the 
standard normal curve. In this study, assuming suspicion of a specific simulee copying 
from a specific source, the uj statistic was tested for significance without adjustment for 
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a level, a =.0001, .0005, .001, .0025, .005, .01 were used. These values were also used 
in Wollack (1997). 

To investigate the empirical Type I error rate, we simulated tests of 40 and 80 items 
for 100, 500, and 2000 persons and we computed the number of times a truly noncopier 
was incorrectly identified as a copier. We used 100 replications. Similarly, the detection 
rate was investigated by taking the proportion of replications where the true copier c was 
detected. 



Results 

Relationship Between p* T and Q r 

Scatter plots of p* and Q r were investigated for different sample sizes and number 
of items. Results are shown in Figure 1. For sample size J = 100 (Figure 1 a-b), the 
relationship seems to be linear but for sample size J = 500 (Figure 1 c-d) p* initially 
increases as Q r increases then levels off at approximately Q r = 0.6, and tends to decrease. 
For 2000 examinees (Figure 1 e-f) it is clear that the relationship is curvilinear. 

Quantitative assessment of the fit of the linear and quadratic regression models in 
terms of R 2 and RSE revealed that the model which included the quadratic term had a 
better fit, that is, a larger R 2 and a smaller RSE. For example, for J = 500 and J = 40 
(Figure lc), the value of R 2 for the linear fit is 0.6 (RSE = 0.03), whereas including 
Ql, the value of R 2 increases to 0.66 (RSE = 0.03). Similar observations applied for 
J = 2000. Note that despite the relatively small value of R 2 for J = 100, the fit of 
the quadratic model is still better than the linear model. In general, p* is estimated more 
accurately when the quadratic term is included. 

Empirical and Binomial Agreement Distributions 

For a particular choice of the source and the subject, several agreement distributions 
were constructed for the empirical /('-index and for K* , K\, and K 2 based on the three 
versions of the binomial distributions (p£, p{, and p* 2 ). Results for different sample sizes 
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were similar so we present in Figure 2 a typical example of these distributions for sample 
size 500. 

In general, the empirical distribution (Figures 2a) tends to have larger upper tail 
(negatively skewed) whereas the distribution based on (Figures 2d) consistently have 
smaller upper tails. Note that the size of the upper tail of the distribution greatly influences 
the value of the K-index. As can be seen from equations (1), (4), (6), and (7), the K- 
index is computed as the sum of the upper tail probability densities. This implies that 
a distribution with the smallest upper tail yields smallest numerical values of the K- 
index and thus provides the strongest evidence of answer copying. Since the empirical 
agreement distribution has a larger upper tail, it is expected that the K-index computed 
based on this distribution will be large and thus implies low detection rates. 

Further, we found that the empirical distribution had the largest upper tail when the 
number of simulees was smallest, that is for J = 100 (graph not presented here). Thus, 
for J = 100, the K-index based on equation (1) is expected to be too conservative. 

Type I Error Rate 

Figure 3 shows the graphical comparison of the empirical Type I error rates of the 
K-index and u, across combinations of examinee sizes and number of items. Type I error 
rates that are on the identity (boundary) line represents perfect Type I error control, Type 
I errors above the boundary line are larger than the nominal values and those below it 
are smaller than the nominal values. K\ and K 2 (denoted in Figure 3 as K1 and K2, 
respectively) are K-indices based on equations (6) and (7), whereas the K* is based on 
equation (4). The Type I error rate of the K-index based on equation (1) was found to be 
much below the nominal a level and is not presented here. 

The K-indices were able to control the Type Error rates below the nominal alpha 
level in all situations considered. In most cases, u was also able to control its Type I 
error below the nominal level, with the exceptions for the 80-item test with 500 and 2000 
simulees wherein the Type I error of u exceeded its nominal level by approximately .005 
(see Figures 3d and 3f). 
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We also investigated the variance of the K-index and uj across replications. The 
variance of the K-index decreased with increasing percentage of copied answers, sample 
size, and number of items. The variance of uj decreased with increasing percentage of 
copying but unlike the K-index, uj seems not sensitive to changes in sample size and 
number of items. The variance of the K-index was almost equal to uj for longer tests, 
large number of examinees, and a large percentage of copying. For example, for an 80- 
item test and 100 examinees, the variance of K\, K 2 and uj for 10% copying are .0955, 
.0939 and .0704 respectively. As the percentage of copying increases to 40%, the three 
variances decrease to .003, .003, and .002, respectively. 

Detection Rate 

The detection rates of K*, K u # 2 . and uj as a function of a-level for different 
percentages of copying, sample sizes and test lengths were first investigated for the source 
fixed at the 90th percentile. The K-index based on equation (1) was not included in the 
current analysis because its detection rate was extremely low. Figure 4 shows the detection 
rates for 100 simulees on the 40-item test and Figure 5 for 500 simulees on the 80-item 
test. The detection rates for the other simulated configurations were similar and are not 
presented here. 

In almost all simulated datasets, u had the highest detection rate. The difference 
between the detection rates of uj and the K-indices is relatively large for small sample 
size and test length but tends to diminish as the sample size and test length increased. 
For example, the difference in detection rate between u j and K 2 is 0.15 for J = 100, 
I = 40, and 40% copying (see Figure 4a) and it reduce to 0.02 for J = 500 , 1 = 80, and 
40% copying (see Figure 5a). The K-index based on the binomial distribution where p 
was estimated using linear regression with quadratic term included (K 2 ), appeared to be 
slightly better than K\. As expected , K* had the lowest detection rate. 

Further note that the detection rates of the uj and the K-indices increased with the 
percentage of copied answers. Thus, examinees who copied many items are more likely 
to be detected than examinees who copied few items. 
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The probability of detecting a copier who copied 10% of the items is very low-at 
most .08 for u and less than .05 for the K-indices (see Figures 4d and 5d) 

Increasing the number of simulees had no substantial effect on the detection rates of 
u. This is expected since the computation of u depends only on the response pattern of 
the source and the copier and not on other examinees. On the other hand, the detection 
rates of the K-indices increased with the sample size and number of items. For example, 
for 40% copying the detection rate of K 2 is 0.69 for J = 100 and I = 40 (see Figure 4a) 
and it increased to 0.92 for J = 500 and I = 80 (see Figure 5a). 

To investigate the influence of the proficiency level of the source, we also 
investigated the detection rates of the indices when the source was at the 60th percentile 
rank. Results are shown in Figure 6 for 100 simulees and a 40-item test. Comparing 
Figure 6 with Figure 4 revealed a slight increase in the detection rate of ui, K \ and K 2 
for 40% and 30% copying but for 20% and 10% copying, the detection rates were almost 
the same; the detection rate of K* substantially increased for 40% copying but not for the 
other percentages of copying. Comparing the indices within Figure 6 revealed that ui still 
maintains the highest detection rate followed by K\ and K 2 which are close to each other 
and then by K*. 



Discussion 

In this study we investigated the statistical properties of the K-index and compared 
its detection rate with the detection rate of the ui statistic. The practical usefulness of these 
statistics will depend on the application at hand. As was shown in this study, the use of 
these indices need not be restricted to large-scale testing but can also be applied for small 
samples consisting of 100 examinees. As others have discussed, these indices can be used 
to obtain additional evidence for answer copying when a proctor has observed irregular 
behavior. An alternative is to use these indices for routine monitoring of test responses 
to prevent copying or for triggering the need to employ such measures. For example, a 
faculty member can inform privately a pair with a very high index value of its occurrence 
and suggest that they not sit together on subsequent tests. 
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Results showed that in general, the binomial success probability, p, is better estimated 
by a quadratic function than by a linear function of the proportion wrong answers, Q. 
However, when the dataset is large (J = 2000), the relationship between p and Q was 
nearly linear at the lower end of Q (e.g., Q < 0.6). This finding supported the findings 
by Holland (1996) when he used the linear function to estimate p by Q. In his study, he 
used ETS data for which the source and the copier generally belonged to the upper end 
of the ability continuum (e.g., few wrong answers or low value of Q). 

When using the K-index for small datasets (J = 100), it is not advisable to use the 
empirical agreement distribution nor its binomial approximation based on equation (4). 
In terms of distributional shape, the empirical agreement distributions was negatively 
skewed whereas the binomial distributions-especially the one based on exhibited a 
positively skewed distribution. This resulted in a larger numerical value of the K-index 
despite the higher percentage of answers copied by the copier. 

Results further showed that all approximations of the K-index were able to hold the 
Type I error rates below the nominal level in all situations simulated. Thus, the K-index 
has more favorable statistical properties than the g 2 index (Frary et al., 1977) which failed 
to control the nominal Type I error rates (Wollack, 1996). 

Although u> had higher detection rates than K\ and K 2 for simulee sizes 100 and 
500, the differences in detection rates are small using 2000 simulees. It is expected that 
using more that 2000 simulees the detection rates of K\ and K 2 will further improve. We 
don’t recommend to use K* in practice while K 2 might be a good alternative if for some 
reason it is not possible to use u>. 

Finally, the random variable M is a non-negative count of matching incorrect 
answers. For future study, it may be important to investigate the fit of a Poisson 
distribution as an alternative distribution for the random variable M. Furthermore, the 
weighted matching correct answers between the source and the copier can be included in 
the computation of the copying index. The weight may be taken as some function of the 
probability of correct response. Incorporating the weighted matching correct answers in 
addition to matching incorrect answers differentiates the K-index from this new index. 
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Also, several measures can be investigated to minimize the impact of discreteness due to 
small sample size. 
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